* Defining the Problems
* Checking versions of libraries
* Importing dataset
* Data understanding
* Pre-Preocessing and cleaning of data
* Summarizing the dataset to find actionables insights
* Visualization of data
Features Descriptions ⇣
App - Name of the Apps
Category - Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc
rating - The current average rating (out of 5) of the app on Google Play
reviews - Number of user reviews given on the app
Size - Size of the app in MB (megabytes)
Installs - Number of times the app was downloaded from Google Play
Types - Whether the app is paid or free
Price - Price of the app in US$
Last_Updated - Date on which the app was last updated on Google Play Store
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import plotly.colors as colors
pio.templates.default = "plotly_white"
import os
import warnings
warnings.filterwarnings('ignore')
from plotly.subplots import make_subplots
ax=sns.palplot(['hotpink', 'white','yellow'],)
plt.title("Play Store Palette ",loc='left',fontsize=10,y=1)
plt.show()
Importing Data ⇣
¶data= pd.read_csv('googleplaystore.csv')
data[:4]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
data.columns = data.columns.str.replace(" ","_")
data.isna().sum().to_frame()
| 0 | |
|---|---|
| App | 0 |
| Category | 0 |
| Rating | 1474 |
| Reviews | 0 |
| Size | 0 |
| Installs | 0 |
| Type | 1 |
| Price | 0 |
| Content_Rating | 1 |
| Genres | 0 |
| Last_Updated | 0 |
| Current_Ver | 8 |
| Android_Ver | 3 |
rating_median = data["Rating"].median()
data["Rating"].fillna(rating_median, inplace=True)
data.dropna(inplace = True)
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 10829 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10829 non-null object 1 Category 10829 non-null object 2 Rating 10829 non-null float64 3 Reviews 10829 non-null object 4 Size 10829 non-null object 5 Installs 10829 non-null object 6 Type 10829 non-null object 7 Price 10829 non-null object 8 Content_Rating 10829 non-null object 9 Genres 10829 non-null object 10 Last_Updated 10829 non-null object 11 Current_Ver 10829 non-null object 12 Android_Ver 10829 non-null object dtypes: float64(1), object(12) memory usage: 1.2+ MB
data["Reviews"] = data["Reviews"].astype("int64")
data.Installs.unique()
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
'10+', '1+', '5+', '0+'], dtype=object)
data.Installs = data.Installs.apply(lambda x:x.replace("+",""))
data.Installs = data.Installs.apply(lambda x:x.replace(",",""))
data.Installs = data.Installs.apply(lambda x:int(x))
same problem with Price Columns
data['Price'].unique()
array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
'$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
'$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
'$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
'$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
'$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
'$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
'$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
'$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
'$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
'$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
'$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
'$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
'$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)
data['Price'] = pd.to_numeric(data['Price'].str.replace('$',''))
data["Genres"] = data["Genres"].str.split(";").str[0]
data.Genres.unique()
array(['Art & Design', 'Auto & Vehicles', 'Beauty', 'Books & Reference',
'Business', 'Comics', 'Communication', 'Dating', 'Education',
'Entertainment', 'Events', 'Finance', 'Food & Drink',
'Health & Fitness', 'House & Home', 'Libraries & Demo',
'Lifestyle', 'Adventure', 'Arcade', 'Casual', 'Card', 'Action',
'Strategy', 'Puzzle', 'Sports', 'Music', 'Word', 'Racing',
'Simulation', 'Board', 'Trivia', 'Role Playing', 'Educational',
'Music & Audio', 'Video Players & Editors', 'Medical', 'Social',
'Shopping', 'Photography', 'Travel & Local', 'Tools',
'Personalization', 'Productivity', 'Parenting', 'Weather',
'News & Magazines', 'Maps & Navigation', 'Casino'], dtype=object)
data["Size"].unique()
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
'4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M',
'11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M',
'26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M',
'5.7M', '8.6M', '2.4M', '27M', '2.7M', '2.5M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
'4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
'4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
'23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
'8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
'5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
'6.2M', '18k', '53M', '1.4M', '3.0M', '5.8M', '3.8M', '9.6M',
'45M', '63M', '49M', '77M', '4.4M', '4.8M', '70M', '6.9M', '9.3M',
'10.0M', '8.1M', '36M', '84M', '97M', '2.0M', '1.9M', '1.8M',
'5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M',
'72M', '43M', '7.7M', '6.3M', '334k', '34M', '93M', '65M', '79M',
'100M', '58M', '50M', '68M', '64M', '67M', '60M', '94M', '232k',
'99M', '624k', '95M', '8.5k', '41k', '292k', '80M', '1.7M', '74M',
'62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M', '71M',
'86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k', '899k',
'378k', '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M',
'696k', '544k', '525k', '920k', '779k', '853k', '720k', '713k',
'772k', '318k', '58k', '241k', '196k', '857k', '51k', '953k',
'865k', '251k', '930k', '540k', '313k', '746k', '203k', '26k',
'314k', '239k', '371k', '220k', '730k', '756k', '91k', '293k',
'17k', '74k', '14k', '317k', '78k', '924k', '902k', '818k', '81k',
'939k', '169k', '45k', '475k', '965k', '90M', '545k', '61k',
'283k', '655k', '714k', '93k', '872k', '121k', '322k', '1.0M',
'976k', '172k', '238k', '549k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '705k', '306k', '904k', '473k', '175k',
'350k', '383k', '454k', '421k', '70k', '812k', '442k', '842k',
'417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k',
'429k', '192k', '200k', '460k', '728k', '496k', '816k', '414k',
'506k', '887k', '613k', '243k', '569k', '778k', '683k', '592k',
'319k', '186k', '840k', '647k', '191k', '373k', '437k', '598k',
'716k', '585k', '982k', '219k', '55k', '948k', '323k', '691k',
'511k', '951k', '963k', '25k', '554k', '351k', '27k', '82k',
'208k', '913k', '514k', '551k', '29k', '103k', '898k', '743k',
'116k', '153k', '209k', '353k', '499k', '173k', '597k', '809k',
'122k', '411k', '400k', '801k', '787k', '50k', '643k', '986k',
'97k', '516k', '837k', '780k', '961k', '269k', '20k', '498k',
'600k', '749k', '642k', '881k', '72k', '656k', '601k', '221k',
'228k', '108k', '940k', '176k', '33k', '663k', '34k', '942k',
'259k', '164k', '458k', '245k', '629k', '28k', '288k', '775k',
'785k', '636k', '916k', '994k', '309k', '485k', '914k', '903k',
'608k', '500k', '54k', '562k', '847k', '957k', '688k', '811k',
'270k', '48k', '329k', '523k', '921k', '874k', '981k', '784k',
'280k', '24k', '518k', '754k', '892k', '154k', '860k', '364k',
'387k', '626k', '161k', '879k', '39k', '970k', '170k', '141k',
'160k', '144k', '143k', '190k', '376k', '193k', '246k', '73k',
'992k', '253k', '420k', '404k', '470k', '226k', '240k', '89k',
'234k', '257k', '861k', '467k', '157k', '44k', '676k', '67k',
'552k', '885k', '1020k', '582k', '619k'], dtype=object)
data["Size"].replace("M","", regex=True, inplace = True)
data["Size"].replace("k","", regex=True, inplace = True)
size_median = data[data["Size"]!="Varies with device"]["Size"].astype(float).median()
data["Size"].replace("Varies with device", size_median, inplace=True)
data.Size = pd.to_numeric(data.Size)
data['Last_Updated']=pd.to_datetime(data['Last_Updated'])
data[:3]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Current_Ver | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14.0 | 500000 | Free | 0.0 | Everyone | Art & Design | 2018-01-15 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7 | 5000000 | Free | 0.0 | Everyone | Art & Design | 2018-08-01 | 1.2.4 | 4.0.3 and up |
fig = px.histogram(data, x='Type', color='Type',color_discrete_sequence= px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Free and Paid',
xaxis=dict(title='Types'),
yaxis=dict(title='No. of Data'),
width=400,height=400,
bargap=0.2
)
fig.show()
Free apps are more than Paid apps
values = data.Type.value_counts().values
labels = data.Type.value_counts().index
fig = px.pie(data,
values=values,
names=labels,
title='Type',
hole=0.5,
color_discrete_sequence= px.colors.qualitative.Alphabet_r
)
fig.update_traces(textposition='auto', textinfo='percent+label')
fig.update_layout(title_text = 'App Distribution ',title_font= dict(size= 24),
width=700, height=400
)
fig.show()
fig = px.box(data, x='Type', y='Rating', color='Type', color_discrete_sequence=px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Comparison of Rating Free Vs Paid',
xaxis=dict(title='Type'),
yaxis=dict(title='Rating'),
width=700,height=400
)
fig.show()
The average rating of Paid is higher than the Free..
fig = px.histogram(data, x='Content_Rating', color='Content_Rating',color_discrete_sequence= px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Content_Rating',
xaxis=dict(title='Content_Rating'),
yaxis=dict(title='No. of Ratings'),
width=700,height=400,
bargap=0.2
)
fig.show()
as you can see the most of the apps are in the Everyone Category , then followed by the teen and Everyone 10+ ..
fig = px.box(data, x='Content_Rating', y='Rating', color='Content_Rating', color_discrete_sequence=px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Comparison of Rating VS Content Rating',
xaxis=dict(title='Content Rating'),
yaxis=dict(title='Rating'),
width=700,height=400
)
fig.show()
As you can see, the ratings are almost the same for Everyone, Teen, Everyone 10+. The Everyone has more outliers than other categories. Mature 17+ has the lowest average. The rating average of the 18+ category is higher than the others.
top_15_Category = data['Category'].value_counts().sort_values(ascending=False).head(15)
fig = px.bar(y=top_15_Category.index, x=top_15_Category.values, orientation='h', color_discrete_sequence=px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Numbers of Category',
xaxis=dict(title='Counts'),
yaxis=dict(title='Category'),
yaxis_categoryorder='total ascending',
width = 700, height= 500
)
fig.show()
The most Applications are in the family Category , then followed by the Games and Tools..
fig = px.scatter(data, x='Category', y='Price', color='Category', color_discrete_sequence=px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Category & Price',
width=700, height=500,
xaxis=dict(title='Category', title_font=dict(color='white'), tickfont=dict(color='white')),
yaxis=dict(title='Price', title_font=dict(color='white'), tickfont=dict(color='white')),
plot_bgcolor='grey',
paper_bgcolor='black',
legend=dict(title='Category', font=dict(color='white')),
title_font=dict(color='white')
)
fig.show()
Notice the highest Paid Applications are Finance , Lifestyle and Family
fig = px.histogram(data, x='Rating', nbins=20, marginal='rug', opacity=1, color_discrete_sequence=px.colors.qualitative.Alphabet_r)
fig.update_layout(
title="Histogram with KDE for the Rating Column",
xaxis_title="Ratings",
yaxis_title="Counts",
# paper_bgcolor='rgba(0,0,0,0)',
# plot_bgcolor='rgba(0,0,0,0)',
# font_color='white',
width= 700 , height = 400
)
fig.show()
As you can see, the most values are distributed around 4.3.
data[:1]
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content_Rating | Genres | Last_Updated | Current_Ver | Android_Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19.0 | 10000 | Free | 0.0 | Everyone | Art & Design | 2018-01-07 | 1.0.0 | 4.0.3 and up |
top_15_Category = data['Genres'].value_counts().sort_values(ascending=False).head(15)
fig = px.bar(y=top_15_Category.index, x=top_15_Category.values, orientation='h', color_discrete_sequence=px.colors.qualitative.Alphabet_r
)
fig.update_layout(
title='Numbers of Genres',
xaxis=dict(title='Counts'),
yaxis=dict(title='Genres'),
yaxis_categoryorder='total ascending',
width = 700, height= 500
)
fig.show()
Notice that the most gernes are from Tools , Entertainment , Education and followed by others .
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
text = " ".join(App for App in data.App)
word_cloud1 = WordCloud(collocations = False, background_color = 'black',
width = 1200, height = 1080,colormap='Set1').generate(text)
plt.figure(figsize=(18,8))
plt.imshow(word_cloud1, interpolation='bilinear')
plt.axis("off")
plt.title("Most Popular Apps", size=20)
plt.show()
🙇🏻♂️!
Prem Mandal....
Let your happiness shine through! 😊